In this section, we’re going to explore this dataset and try to generate some insights through various visualization methods. Since the size of this dataset is quite large and it has 36 columns now, we can come up with as many plots as we want. But appearantly not all plots are interesting or useful. So we will only show several plots that make us interested or feel strange.

Packages

library(tidyverse)
library(scales)
library(plotly)
library(gridExtra)
library(modelr)
library(tidytext)

Accident count

This dataset contains traffic accident records in 49 states. We can use a map to see the accident distribution from 2016 to 2019.

The top 10 states with the most accident count are highlighted in the map. Later in the modeling part we will mainly focus on these 10 states.

Distance affected by accidents

The “Distance” variable in the dataset means the length of the road extent affected by the accident. We would like to know the relationship between distance and severity levels.

Accident count in each severity level

According to our understanding, the severity level distribution in each year should be similar assuming the traffic condition does not change remarkably from 2016 to 2019.

But the result shows from 2018 to 2019, severity level 2 has a sudden increase while level 3 has a sudden decrease, whicn seems a little strange.

Right now we still cannot give a valid explanation to this result. To answer this, we may need to do some further research, like talking to someone from the traffic data source. Because what if the rule itself to distinguish level 2 and level 3 has changed since 2018, there is no way we can confirm this by ourselves.

One more thing to bear in mind, this dataset is seriously unbalanced in different severity levels: most of the accidents are classified into level 2 and level 3.

Accident account in different time scales

One interesting thing we find is that when we split the original time variable into several new variables, some patterns can be revealed by visualization.

From this plot, the first thing we can see is that the accident count experiences an obvious increase after July and a sudden decrease in January. And from the bottom subplot, we can recognize the weekly pattern of accidents: more accidents happen during weekdays and fewer accidents happen during weekends. Also, when we look closely, after July, it seems the month has more impact on weekdays’ accidents than weekends’ because apparently from August to December the weekdays’ accidents experience a higher increase than weekends’.

The weekly pattern is easy to explain, since people are more busy during weekdays, there should be more vehicles on the road. As for the monthly pattern, we suppose this may be the result of holiday season and many schools’ reopening. To answer this with certainty, further research is needed.

Also, the hourly pattern of accidents is worth mentioning too.

Is seems most accidents happen at these two intervals: 7am - 8am, 16pm - 17pm. And then when we look at the hourly patterns separately on weekdays and weekends, we notice that the previous result should be attributed to the hourly pattern on weekdays because 7am - 8am and 16pm - 17pm are the time when most people commute on weekdays. As for the hourly pattern on weekends, we can only conclude that most accidents happen during daytime.

The impact of weather condition on accident severity.

Common sense suggests that weather condition should have a great impact on accident severity. It’s reasonable to think severe accidents happen more often during bad weathers, and less severe ones happen more often during clear days. However, the result of visualization seems to be against this opinion.

Actually, when we plot the most common weather conditions under each severity level, the distribution looks similar in each level. Only level 1 has an obvious difference that more level 1 accidents happen during clear weather. And we can see more severe accidents (level 3 and 4) also happen a lot during clear days.

So, it seems the severity of an accident is not mainly affected by weather conditions. (Later in modeling part, when we analyze important predictors to include in the model, weather condition is like in the middle between the most important features and the least important ones)

Something to be careful about

As you can see, this dataset is very unbalanced in different severity levels. So, when we try to discover the pattern between severity and other variables we usually use “proportion” instead of “count” to compare. This is because once we use “count” as the comparison basis, the large “count” values of severity level 2 and level 3 will cover up the patterns of level 1 and level 4 (this happens at the fouth plot, but luckily we have the zoom in tool).